Machine Learning & Non-Linear Models

Chris Bail
Computational Sociology
Duke University

Here Come the Machines...

 

Last class we noted a number of problems that plague analysis of large, complex datasets:

-What do p-values mean in the age of “big data”?
-How should we handle non-linear relationships between predictors and outcomes?
-What about causal Complexity/rare events?

An Introduction to Machine Learning

 

Today we are going to talk about some exciting new (or relatively new) tools that help address some of these problems from the interdisciplinary field of “machine learning” (a.k.a statistical learning), which comes out of statistics and computer science/engineering.

An Introduction to Machine Learning

 

We are not likely to see these models in ASR or AJS anytime soon because linear models are so deeply ingrained within the discipline. But I think these new methods will eventually make their way into our journals, and in the meantime they serve a very important purpose: improving understanding of our datasets before we analyze them using linear models.

An Introduction to Machine Learning

 

Most people use machine learning to make predictions: for example, data scientists use machine learning to predict which types of advertising campaigns will work, and biostatisticians use them to predict who will live and die.

An Introduction to Machine Learning

 

We are going to use them in a slightly different way: we are going to employ machine learning techniques to better classify our data and identify complex relationships between predictors and outcomes. Once again, I will argue that we can use these methods to build better linear models, while we wait for sociology to catch up with the state of the art.

Machine Learning

 

We could easily spend the entire semester discussing machine-learning. If you are curious about this topic- and especially if you want to examine the issue of prediction in greater detail- I recommend you check out this excellent online video course produced by two people who practically invented the field (at least within statistics)

http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/

Machine Learning

 

Also, if you want to review what we do today, check out this amazing visual tutorial about machine learning:

http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

Machine Learning

 

Instead of going in depth into each technique, I am going to do what I always do in this class: try to give you enough of an overview that you could pursue this subject in more detail on your own. As always, we will work through messy real-world examples that I hope will help you see the strengths and weaknesses of these new models.

Agenda for Today

 

1) Generalized Additive Models
2) Regression/Classification Trees
3) Random Forests

Bonus:
Guest lecture by Alex Hanna on supervised topic models!!!

GENERALIZED ADDITIVE MODELS

Generalized Additive Model (GAM)

 

One huge problem with many linear models is that they are parametric, or defined by functions that assume all predictors have the same relationship with the outcome of interest. By contrast, GAMs are non-parametric, meaning that the shape of the predictor functions is fully determined by the data.

Generalized Additive Model (GAM)

 

We can of course perform non-linear transformations of variables in order to account for this problem, but in order to do this, we need to know that non-linearity is a problem in the first place.

Generalized Additive Model (GAM)

  Even if we are careful and examine the linearity of the relationship between each predictor and outcome in detail, such transformations are difficult to interpret: do you know what it means if a polynomial term is positive or negative in a regression model? What if there are multiple non-linear relationships within a model- as is often the case in social science data? To make matters worse, transforming multiple predictors within the same model can create colinearity issues.

Isn't There a Better Way?

 

Even if we can get around all of these problems, it would take us a very long time to do so…

Yes! Generalized Additive Models (GAMs)

 

Generalized Additive Models are a unique approach developed in 1986 by two statisticians. GAMs look a lot like traditional regression models, but they are much “smarter.” GAMs automatically select the best transformation of each variable FOR YOU- these can be both non-linear or linear.

Generalized Additive Models (GAMs)

 

After “Smoothing” or transforming each variable (where necessary) GAMS simultaneously estimate the relationship between each predictor and the outcome, outputting coefficients that are very similar to a regular linear model.

Generalized Additive Models (GAMs)

 

This not only gets us around the issue of having to try out different linear combinations of variables we have hunches about but also helps us account for potential non-linearities that we did not even know to look for.

Generalized Additive Models (GAMs)

 

GAMS can also support any type of link function, or in other words, any type of dependent variable that might be put into a generalized linear model (e.g. binary, continuous, etc.). They also have spiffy ways of dealing with missing data, without multiple imputation.

Generalized Additive Models (GAMs)

 

alt text

(Illustration by Kim Larson)

Generalized Additive Models (GAMs)

 

GAMs are kind of a compromise between conventional linear models which are almost always very biased but easy to interpret; and state of the art machine learning techniques such as random forests which are very good at representing/classifying relationships between variables but very difficult to interpret.

I think we will probably see GAMS in AJS and ASR before random forests… Note, however, that GAMs can be converted back into linear models, so this is what we may see in the near-term.

Great- but how do we code GAMs?

 

Let's look at an example that was put together by Michael Clark at Notre Dame these are science test scores from the PISA cross-national education data:

data <- read.csv("http://www.nd.edu/~mclark19/learn/data/pisasci2006.csv")

The outcome we are going to examine is the “Overall Science Score” of each student in a large group of countries.

Great- but how do we code GAMs?

Let's take a look at the data:

library(car)
scatterplotMatrix(data)

PISA International Education Dataset

Coding GAMs

First, let's run a simple linear model

library(mgcv)
first_model <- gam(Overall ~ Income + Edu + Health, data = data)
summary(first_model)

Family: gaussian 
Link function: identity 

Formula:
Overall ~ Income + Edu + Health

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   121.18      78.97   1.535   0.1314    
Income        182.32      85.27   2.138   0.0376 *  
Edu           234.11      54.78   4.274 9.06e-05 ***
Health         27.01     134.90   0.200   0.8421    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


R-sq.(adj) =  0.616   Deviance explained = 63.9%
GCV = 1212.3  Scale est. = 1119      n = 52

Coding GAMs

Now let's try a model that applies a smoother function to each predictor:

second_model <- gam(Overall ~ s(Income) + s(Edu) + s(Health), data = data)

The s() here tells the gam function that we want to apply a smoother to each variable in the dataset- or that we want R to choose the appropriate transformation for each predictor in the model.

Coding GAMs

The mgcv package will also create plots that describe the realtionship between each predictor and outcome!:

plot(second_model, pages=1, residuals=T, pch=19, cex=0.25,
     scheme=1, col='#FF8000', shade=T,shade.col='gray90')

Nifty

Even More Nifty...

 

This “contour plot” lets us look at multiple variables at once:

vis.gam(second_model, type = "response", plot.type = "contour")

Even More Nifty...

The Nifty-est?

We can also do something called “tensor product smoothing” here which you can think of as a smoother that smooths the smoothers of multiple variables at once in order to create a three dimensional plot of the non-linear relationships between the predictors and the outcome.

third_model <- gam(Overall ~ te(Income, Edu), data = data)
vis.gam(third_model, type='response', plot.type='persp',
        phi=30, theta=30,n.grid=500, border=NA)

The Nifty-est?

Why aren't we all using GAMs?

 

Once again, I'm not sure why GAMs don't have more influence in sociology. At the very least, I encourage you to use them to triangulate your work with linear models.

REGRESSION/CLASSIFICATION TREES

What is a Regression Tree?

  alt text

Illustration from James et al. “An Introduction to Statistical Lerning, p. 308

What is a Regression Tree?

Regression trees first separate data into “regions” based using something called recursive binary partioning. Essentially this means that the data are grouped into subsets of observations that have similar values on a predictor.

What is a Regression Tree?

Next, regression tree algorithims simply take the mean value of all observations within each “region” or “subset” and use this to make a prediction about an outcome.

Example of a Regression Tree

 

A brief example will help me illustrate this more clearly. We are going to create a regression tree using the tree function within the tree package and the built-in mtcars dataset

install.packages("tree")
library(tree)
tree.cars <- tree(mpg~., mtcars)

The mpg~. indicates we want to treat mpg as the outcome but look at every other variable in the dataset as a possible preditor.

Example of a Regression Tree?

 

One of the great things about regression trees is that they are very easy to interpret- even for those who have no idea what regression is!

Regression tree visualizations resemble a kind of “flowchart” within an organization- indeed, they are often called “decision” trees for this reason. The main difference is that regression trees describe how different predictors combine to produce an outcome (in this case miles per gallon of cars), instead of hiearchy within a group.

Plotting a Regression Tree

The first line below plots the regression tree and the second line adds labels to the plot:

plot(tree.cars)
text(tree.cars ,pretty =0)

Behold... The Regression Tree

Interpreting a Regression Tree

 

The way to read these trees is to focus on the '<' sign. The branches to the left are less than and the branches on the right are greater than. the “wt” variable here describes weight (in tons), so the tree shows us that this is a big factor in mpg, which is not surprising. In fact, all cars less than 2.26 get about 30mpg. Those greater than 2.26 tons have two categories: those with less than six cylinders and those with eight cylinders. The V8 cars with the highest horsepower (hp) also have the lowest mpg.

Pruning a Regression Tree

 

Finding the best fit for the data usually requires “pruning” the tree, or removing branches of the tree that add more complexity but less explanatory power. To do this we use the cv.tree command

prune.cars =prune.tree(tree.cars ,best =5)
plot(prune.cars)
text(prune.cars ,pretty =0)

In this case, our tree was already so simple that pruning it did not add parsimony.

A Classification Tree

 

Let's run through a quick classification tree example. The difference between a regression and a classification tree is simply that the former is for continuous outcomes and the latter is for binary or categorical outcomes.

Instead of selecting the average value of the outcome within each subset of the dataset, a classification tree selects the most common value.

A Classification Tree

 

Let's use the Pew data from a previous class in order to identify patterns of support for the so-called “ground zero mosque”- pew10 -according to education, sex, age, income, and party identification:

load("Pew Data.Rdata")
tree.groundzero =tree(pew10~educ+sex+age+inc+partyln, pewdata)
plot(tree.groundzero)
text(tree.groundzero ,pretty =0)

Behold... The Classification Tree

Time to Party

Another neat party for creating trees is the party package. In addition to the same tree structure generated by the two packages above, this package adds plots of the cases within each branch of the tree. It also uses a slightly different technique for partionining regions in the data. Let's take a look:

install.packages("party")
library(party)
prettycartree <- ctree(mpg~., data=mtcars)
plot(prettycartree)

Time to Party

RANDOM FORESTS

Random Forests

 

The main downside of regression trees is that they are not the most accurate tool available. In fact, if all relationships between predictors and outcoems are linear, then a linear model will be more efficient.

Random Forests

 

The principal reason that regression trees are not the most efficient way of making predictions about data is that they only use one “round” of subsetting the data. Therefore, regions of the data with high variance create problems.

Random Forests

 

But what if instead of partitioning the data only once, we did it hundreds of even thousands of times in slightly different ways and then averaged the results? This is what is known as bootstrapping, and in the language of regression trees, it is often described as “bagging.”

Random Forests

 

Random forests are an extension of bagging that includes a small tweak that “decorrelates” the bagged trees. It does this by taking a random sample of predictor variables within each subset, and then choosing one of them as a branch. The random selection of the predictor helps avoid the influence of one important predictor over other predictors that have a more moderate yet still meaningful association with the outcome.

Random Forests

 

Unfortunately, when we extract a large number of subsets from the data using bagging or random forests, we sacrifice the interpritibility a single regression tree. That is, we cannot construct a tree that visualizes all the bagged/randomly partioned datasets.

Random Forests

 

On the other hand, we can still measure how important each predictor is; or how important that “branch” is within the regression tree by examining the residual sum of squares.

Coding Random Forests

 

Let's try it out:

install.packages("randomForest")
library(randomForest)
set.seed (123) # for reproducibility
rf.cars =randomForest(mpg~.,data=mtcars,importance =TRUE)

Coding Random Forests

 

To create the variable importance plot, we use this line:

varImpPlot (rf.cars)

Coding Random Forests

The first plot shows us the how much error we add to our model if we remove the variable from our model using the Mean Square Error.

Coding Random Forests

The plot on the right describes the total increase in “node impurity” if the variable is removed from the model

Training and Prediction

Perhaps best way to figure out how solid your predictions are is to use a training dataset or subset of the data to create the model, and then see how well it predicts the outcomes. Unfortunately, I think this type of analysis will be too foreign to many reviewers in sociology, so I would not recommend presenting it as your main analysis. Once again, I do think it may be useful for enhancing your understanding of complex patterns within the dataset, however.

Boosting

The most recent advance in regression trees is the concept of boosting, which is similar to bagging but allows trees to grow sequentially. If you want to learn more about these models check out the book “An Introduction to Statistical Learning” and the gbm package.

QUESTIONS?

Next Week: Agent-Based Modeling

 

One of the most exciting parts of computational social science is our (relatively) newfound ability to simulate virtual societies. Such simulations can help us ex- plore links between the micro and macro; emergent phenomena more broadly, and help us get a better sense of when and where our data and models are wrong. Agent-based models can also help us develop new theories to test on data from the real world. More broadly, agent-based models are another way of making sense of the highly complex datasets that you might collect as part of this class.